2 Description of the data - EDA

We have data from three buildings located in Vienna, each with solar panels equipped. There are 2 different data collected from their sensors. One is the energy produced in kW, the other is the sun radiation. The data is collected between 2016 August 09 and 2019 July 01 with 15 minute intervals.

Furthermore, we have acquired our weather data from https://www.worldweatheronline.com/developer/api/docs/historical-weather-api.aspx using the Python script called importWeatherData.py attached in the folder. The past weather API allows us to retrieve weather data from specified time period and location. It also supports retrieval of data for multiple locations at once. In our case, we only needed data from Vienna because all the three buildings are in Vienna. Also, we have retrieved data for every 24 hours(every day) because it was the most convenient time range we could use.

2.1 Setup

# loading the libraries
suppressPackageStartupMessages({
  library(ggplot2)
  library(data.table)
  library(forecast)
  library(tidyr)
  library(lubridate)
  library(dplyr) 
  library(tseries)
  library(plotly)
  library(nortest)
  library(astsa)
})

Loading the data

# 'sun' - sensor data, otherwise the energy produced
building_2_sun <- readRDS("data/Building 2 sun.rds")
building_2     <- readRDS("data/Building 2.rds")
building_5_sun <- readRDS("data/Building 5 sun.rds")
building_5     <- readRDS("data/Building 5.rds")
building_8_sun <- readRDS("data/Building 8 sun.rds")
building_8     <- readRDS("data/Building 8.rds")

setDT(building_2_sun)
setDT(building_2)
setDT(building_5_sun)
setDT(building_5)
setDT(building_8_sun)
setDT(building_8)

setnames(building_2_sun, "1302611", "sun")
setnames(building_2, "1490017", "energy_produced")
setnames(building_5_sun, "1328370", "sun")
setnames(building_5, "1328347", "energy_produced")
setnames(building_8_sun, "1302169", "sun")
setnames(building_8, "1498763", "energy_produced")

# weather data
weather <- fread("data/vienna.csv", na.strings = c("No moonrise", "No moonset"))

2.2 Sun radiation

columns <- c("building_2_sun", "building_5_sun", "building_8_sun")
plots <- lapply(columns,
                function(col) {
                  plot_ly(data = get(col)[, .(sun=mean(sun)), by=.(timestamp=floor_date(timestamp, "weeks"))],
                          x    = ~timestamp,
                          y    = ~sun,
                          type = "scatter",
                          mode = "lines") %>%
                  layout(yaxis = list(title = paste("Sun radiation on", gsub("_sun", "", col), sep="\n")),
                         xaxis = list(showticklabels=T),
                         showlegend=F,
                         title="Weekly average sun radiation measured on each building")})
subplot(plots, titleY = T, nrows = 3)

The sun radiation measured on each building is quite similar, indicating that the buildings are in close proximity to each other.

We can also say that the most of sun radiation takes place in the middle of summer, while the least in the winter, what makes complete sense.

2.3 Energy production

columns <- c("building_2", "building_5", "building_8")
plots <- lapply(columns,
                function(col) {
                  plot_ly(data = get(col)[, .(energy_produced=sum(energy_produced)), by=.(timestamp=floor_date(timestamp, "weeks"))],
                          x    = ~timestamp,
                          y    = ~energy_produced,
                          type = "scatter",
                          mode = "lines") %>%
                    layout(yaxis = list(title = paste("Energy from", gsub("_", " ", as.character(col)), sep="\n")),
                           xaxis = list(showticklabels=T),
                           showlegend=F,
                           title="Weekly aggregated energy production of each building")})
subplot(plots, titleY = T, nrows = 3)

As we can see, the data from building 2 deviates from building 5 and 8, because it has some really extreme values. So we’ll take a closer look at the outliers there.

As with sun radiation, the most amount of energy produced is in the summer seasons and the least in winter seasons, which also indicates some decent correlation between these two features.

2.4 Summary of all the datasets

Building 2

summary(building_2_sun[,sun])
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##   -0.0867    0.0000    0.1733  130.8078  176.2042 1089.6600
summary(building_2[,energy_produced])
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -32716.85      0.00      0.00      0.49      0.55  32716.85

Building 5

summary(building_5_sun[,sun])
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   -1.056    1.869   12.919  146.211  180.131 1286.025
summary(building_5[,energy_produced])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   1.591   2.032  18.560

Building 8

summary(building_8_sun[,sun])
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    3.354    3.519    3.675  132.371  177.771 1068.678
summary(building_8[,energy_produced])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3009  0.3760  2.3920

As you can see, the range of values of energy_produced feature in the building_2 is significantly different from all the other two corresponding features of other datasets, even though the median and the mean are quite similar to others.

Let us count NAs if there are any:

sum(is.na(building_2_sun))
## [1] 0
sum(is.na(building_2))
## [1] 0
sum(is.na(building_5_sun))
## [1] 0
sum(is.na(building_5))
## [1] 0
sum(is.na(building_8_sun))
## [1] 0
sum(is.na(building_8))
## [1] 0

No NAs found in any of the datasets!

Now let us see some boxplots, which potentially show some outliers

par(mfrow=c(2,3))
boxplot(building_2[,energy_produced], main=" Building 2 energy")
boxplot(building_5[,energy_produced], main=" Building 5 energy")
boxplot(building_8[,energy_produced], main=" Building 8 energy")
boxplot(building_2_sun[,sun], main="Building 2 sun")
boxplot(building_5_sun[,sun], main="Building 5 sun")
boxplot(building_8_sun[,sun], main="Building 8 sun")

All buildings have some number of outliers, especially above the maximum.

Now let us see the distributions of the data

a <- density(building_2[,energy_produced])
b <- density(building_2_sun[,sun])
c <- density(building_5[,energy_produced])
d <- density(building_5_sun[,sun])
e <- density(building_8[,energy_produced])
f <- density(building_8_sun[,sun])

fig1 = plot_ly(x=a$x, y=a$y, type= "scatter", mode = "lines", fill = "tozeroy", name="Energy produced of building 2")
fig2 = plot_ly(x=b$x, y=b$y, type= "scatter", mode = "lines", fill = "tozeroy", name="Sun ratiation of building 2")
fig3 = plot_ly(x=c$x, y=c$y, type= "scatter", mode = "lines", fill = "tozeroy", name="Energy produced of building 5")
fig4 = plot_ly(x=d$x, y=d$y, type= "scatter", mode = "lines", fill = "tozeroy", name="Sun ratiation of building 5")
fig5 = plot_ly(x=e$x, y=e$y, type= "scatter", mode = "lines", fill = "tozeroy", name="Energy produced of building 8")
fig6 = plot_ly(x=f$x, y=f$y, type= "scatter", mode = "lines", fill = "tozeroy", name="Sun ratiation of building 8")

fig <- subplot(fig1, fig2, fig3, fig4, fig5, fig6, nrows = 6)
fig

Distributions seem to be normal, however we can see that there is a number of extreme values, basically what we saw in the boxplots as well. Also, energy produced of building 2 is of different range compared to others, which is weird.

Plotting sun radiation vs energy produced on one plot

building_5_sun %>%
  plot_ly(
    x=~timestamp,
    y=~sun,
    type="scatter",
    mode="lines",
    name="sun radiation",
    line = list(color='#ff7f0e')
  ) %>%
  add_trace(
    inherit = F,
    data=building_5,
    x=~timestamp,
    y=~energy_produced,
    type="scatter",
    mode="lines",
    name="energy produced",
    yaxis = "y2",
    line = list(color = '#1f77b4')
  ) %>%
  layout(
    title = "Building 5",
    yaxis2 = list(
      tickfont = list(color = '#ff7f0e'), 
      overlaying = "y",
      side = "right",
      title = "second y axis - energy"
    )
  )

By rescaling the two data sets, we plotted them together. And it clearly shows that when the sun radiation is higher, there is a proportional growth in the produced energy, so there is correlation!

But are there maybe other factors that could influence the energy production?

2.5 Weather data description

Let’s examine how it looks like:

summary(weather)
##    date_time             maxtempC       mintempC        totalSnow_cm        sunHour         uvIndex      moon_illumination   moonrise           moonset         
##  Min.   :2016-08-09   Min.   :-8.0   Min.   :-13.000   Min.   : 0.0000   Min.   : 3.20   Min.   :1.000   Min.   :  0.00    Length:1057        Length:1057       
##  1st Qu.:2017-04-30   1st Qu.: 7.0   1st Qu.:  2.000   1st Qu.: 0.0000   1st Qu.: 7.70   1st Qu.:2.000   1st Qu.: 19.00    Class :character   Class :character  
##  Median :2018-01-19   Median :15.0   Median :  7.000   Median : 0.0000   Median :10.30   Median :3.000   Median : 46.00    Mode  :character   Mode  :character  
##  Mean   :2018-01-19   Mean   :15.1   Mean   :  7.102   Mean   : 0.1435   Mean   :10.09   Mean   :3.426   Mean   : 46.25                                         
##  3rd Qu.:2018-10-10   3rd Qu.:23.0   3rd Qu.: 13.000   3rd Qu.: 0.0000   3rd Qu.:13.50   3rd Qu.:5.000   3rd Qu.: 73.00                                         
##  Max.   :2019-07-01   Max.   :37.0   Max.   : 22.000   Max.   :18.3000   Max.   :14.50   Max.   :8.000   Max.   :100.00                                         
##    sunrise             sunset            DewPointC         FeelsLikeC        HeatIndexC       WindChillC       WindGustKmph     cloudcover        humidity    
##  Length:1057        Length:1057        Min.   :-15.000   Min.   :-18.000   Min.   :-10.00   Min.   :-18.000   Min.   : 4.00   Min.   :  0.00   Min.   :37.00  
##  Class :character   Class :character   1st Qu.:  0.000   1st Qu.:  1.000   1st Qu.:  4.00   1st Qu.:  1.000   1st Qu.:13.00   1st Qu.: 16.00   1st Qu.:65.00  
##  Mode  :character   Mode  :character   Median :  6.000   Median :  9.000   Median : 11.00   Median :  9.000   Median :19.00   Median : 34.00   Median :72.00  
##                                        Mean   :  5.585   Mean   :  9.209   Mean   : 11.04   Mean   :  9.066   Mean   :20.04   Mean   : 39.53   Mean   :72.16  
##                                        3rd Qu.: 12.000   3rd Qu.: 18.000   3rd Qu.: 18.00   3rd Qu.: 18.000   3rd Qu.:26.00   3rd Qu.: 61.00   3rd Qu.:80.00  
##                                        Max.   : 20.000   Max.   : 30.000   Max.   : 30.00   Max.   : 29.000   Max.   :54.00   Max.   :100.00   Max.   :99.00  
##     precipMM         pressure        tempC        visibility     winddirDegree   windspeedKmph     location        
##  Min.   : 0.000   Min.   : 994   Min.   :-8.0   Min.   : 0.000   Min.   : 23.0   Min.   : 3.00   Length:1057       
##  1st Qu.: 0.000   1st Qu.:1013   1st Qu.: 7.0   1st Qu.: 9.000   1st Qu.:154.0   1st Qu.: 8.00   Class :character  
##  Median : 0.300   Median :1018   Median :15.0   Median :10.000   Median :227.0   Median :12.00   Mode  :character  
##  Mean   : 2.542   Mean   :1018   Mean   :15.1   Mean   : 9.202   Mean   :224.9   Mean   :13.06                     
##  3rd Qu.: 2.800   3rd Qu.:1023   3rd Qu.:23.0   3rd Qu.:10.000   3rd Qu.:291.0   3rd Qu.:17.00                     
##  Max.   :45.400   Max.   :1043   Max.   :37.0   Max.   :10.000   Max.   :350.0   Max.   :38.00
str(weather)
## Classes 'data.table' and 'data.frame':   1057 obs. of  25 variables:
##  $ date_time        : IDate, format: "2016-08-09" "2016-08-10" "2016-08-11" "2016-08-12" ...
##  $ maxtempC         : int  26 14 19 19 25 29 26 25 23 26 ...
##  $ mintempC         : int  15 12 9 9 14 15 15 14 15 13 ...
##  $ totalSnow_cm     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ sunHour          : num  12.4 10.3 14.5 12.4 12.4 14.5 14.5 13.4 14.5 14.5 ...
##  $ uvIndex          : int  5 3 5 4 5 6 5 5 5 6 ...
##  $ moon_illumination: int  39 46 53 61 68 75 82 90 97 100 ...
##  $ moonrise         : chr  "12:29 PM" "01:29 PM" "02:27 PM" "03:26 PM" ...
##  $ moonset          : chr  "11:21 PM" "11:51 PM" NA "12:23 AM" ...
##  $ sunrise          : chr  "05:41 AM" "05:43 AM" "05:44 AM" "05:45 AM" ...
##  $ sunset           : chr  "08:18 PM" "08:16 PM" "08:15 PM" "08:13 PM" ...
##  $ DewPointC        : int  14 12 7 9 14 15 14 12 11 11 ...
##  $ FeelsLikeC       : int  21 12 12 14 18 23 21 19 18 19 ...
##  $ HeatIndexC       : int  21 13 14 15 19 23 21 19 18 19 ...
##  $ WindChillC       : int  20 12 12 14 18 22 20 19 18 19 ...
##  $ WindGustKmph     : int  17 21 26 19 13 9 15 6 14 14 ...
##  $ cloudcover       : int  46 76 20 49 62 13 24 56 36 22 ...
##  $ humidity         : int  68 90 66 68 75 66 68 67 67 63 ...
##  $ precipMM         : num  6.1 28.2 0.1 0.1 2 2.3 1.8 1.7 1.5 0.1 ...
##  $ pressure         : int  1017 1018 1022 1025 1025 1022 1021 1017 1013 1012 ...
##  $ tempC            : int  26 14 19 19 25 29 26 25 23 26 ...
##  $ visibility       : int  10 8 10 10 8 10 10 10 10 10 ...
##  $ winddirDegree    : int  283 328 314 275 237 244 149 64 287 173 ...
##  $ windspeedKmph    : int  12 14 17 13 7 6 9 4 10 8 ...
##  $ location         : chr  "vienna" "vienna" "vienna" "vienna" ...
##  - attr(*, ".internal.selfref")=<externalptr>

We can see that this dataset includes many features like maximum and minimum temperature, wind data and moon data, precipitation, etc., which form a complete overview of weather in Vienna. Not all of these features are relevant for our purpose, but we will deal with this in the prepocessing part.

Check if there are missing values:

colSums(is.na(weather))
##         date_time          maxtempC          mintempC      totalSnow_cm           sunHour           uvIndex moon_illumination          moonrise           moonset 
##                 0                 0                 0                 0                 0                 0                 0                36                36 
##           sunrise            sunset         DewPointC        FeelsLikeC        HeatIndexC        WindChillC      WindGustKmph        cloudcover          humidity 
##                 0                 0                 0                 0                 0                 0                 0                 0                 0 
##          precipMM          pressure             tempC        visibility     winddirDegree     windspeedKmph          location 
##                 0                 0                 0                 0                 0                 0                 0

We have some missing values in our weather dataset. However, these values are not really relevant as they are only in columns “moonrise” and “moonset.” We will later deal with this.

We had the assumption that in the weather dataset there might be some other factors that could influence the energy production. Such factors could be clouds, number of sunny hours and UV radiation.

Let’s try to plot how the clouds influence the produced energy:

weather[, .(avg=mean(cloudcover)), by=.(date=floor_date(date_time, "weeks"))] %>%
  plot_ly(
    x=~date,
    y=~avg,
    name="Weekly average\ncloudcover",
    type="scatter",
    mode="lines"
  ) %>%
  add_trace(
    inherit = F,
    data=building_5_sun[, .(daily_avg=mean(sun)), by=.(day=floor_date(timestamp, "weeks"))],
    x=~day,
    y=~daily_avg,
    type="scatter",
    mode="lines",
    name="Weekly average\nsun radiation"
  ) %>%
  layout(title="Cloudcover vs Sun Radiation",
         yaxis = list(title = " "),
         xaxis = list(title = " "))

On the plot above, we can see the weekly aggregated cloudcover and run radiation from two different datasets. One is provided by our sensor, the other is from a public API. But we can clearly see that our weather data is accurate, because even though they are on a different scale, we can see that when there is a higher cloudcover, there is less sunshine, or to be more precise, they are the inverse of each other.